Skip to content

Aggregate GPU task metrics in the profiling tool#2088

Merged
parthosa merged 5 commits intoNVIDIA:devfrom
parthosa:rapids-tools-2020
May 4, 2026
Merged

Aggregate GPU task metrics in the profiling tool#2088
parthosa merged 5 commits intoNVIDIA:devfrom
parthosa:rapids-tools-2020

Conversation

@parthosa
Copy link
Copy Markdown
Collaborator

@parthosa parthosa commented Apr 27, 2026

Contributes #2020

Changes

1. New GPU task-metric aggregation CSVs at three levels

Adds long-format aggregations for the 26 GPU task accumulators emitted by the RAPIDS plugin (GpuTaskMetrics.scala). Today these are only available raw in stage_level_all_metrics.csv.

Level Filename Columns
Stage gpu_stage_level_aggregated_task_metrics.csv stageId, numTasks, metricName, unit, sum, max, avg
SQL gpu_sql_level_aggregated_task_metrics.csv sqlId, metricName, unit, sum, max, avg
App gpu_app_level_aggregated_task_metrics.csv appId, metricName, unit, sum, max, avg

Note:

  • Job level skipped (each Spark action is a job — rows would either duplicate the SQL row or be meaningless setup/collect jobs).
  • numTasks only at stage level, where it varies; at SQL/app it would be a constant per row (already in sql_level_aggregated_task_metrics.csv).
  • CSVs not generated when no GPU metrics are present.

Example (SQL row):

sqlId,metricName,unit,sum,max,avg
24,gpuTime,ms,86643,897,237
24,gpuMaxDeviceMemoryBytes,bytes,,10124115675,

4. Max-aggregated metrics: AccumMetaRef.METRICS_WITH_MAX_AGGREGATES extended from 4 → 9 entries. For these, sum and avg are emitted empty; only max is meaningful.

Testing

  • AnalysisSuite — three new tests: rows produced for GPU log + rollup math; max-aggregated metrics carry only max; empty for CPU-only log.
  • E2E smoke on core/src/test/resources/spark-events-profiling/gpu_oom_eventlog.zstd: all three CSVs produced; cross-level math verifies.

Emits three new long-format CSVs covering the 26 GPU task accumulators
from GpuTaskMetrics.scala (gpu_stage_/sql_/app_level_aggregated_task_metrics.csv).
Auto-discovery by name (gpu*, perfio.s3.*, multithreadReaderMaxParallelism);
units derived from the name (Time/Wait→ms, Bytes→bytes, else count); SQL/app
levels re-sum stage rows. Skips emission when no GPU metrics are present.
Job level intentionally skipped (each Spark action is a job — would either
duplicate the SQL row or be meaningless).

Fixes NVIDIA#2020

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@github-actions github-actions Bot added the core_tools Scope the core module (scala) label Apr 27, 2026
@parthosa parthosa self-assigned this Apr 27, 2026
parthosa and others added 2 commits April 27, 2026 09:48
Adds appId as the leading column on gpu_app_level_aggregated_task_metrics.csv
so downstream consumers can join by application without relying on the output
directory path. Also bumps the copyright year on touched files to 2026 (the
pre-commit hook's sed is BSD-incompatible on macOS and silently no-ops).

Fixes NVIDIA#2020

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
ArrayBuffer.flatMap returns ArrayBuffer (mutable), which no longer
auto-coerces to immutable.Seq under Scala 2.13. Materialize the per-SQL
row collection as Seq before passing to rollupGpuRows, and use an explicit
lambda for the inner flatMap.

Fixes NVIDIA#2020

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa parthosa marked this pull request as ready for review April 27, 2026 20:13
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 27, 2026

Greptile Summary

This PR adds long-format GPU task metric aggregations at three granularities (stage, SQL, app) by mining the existing GPU accumulators in app.accumManager. The rollup math — task-weighted averages, sum propagation, and max-only handling for 9 known max-aggregate metrics — is correctly implemented and well-tested.

The two previously flagged concerns remain open: the index parameter is still unused in aggregateGpuMetricsBySql / aggregateGpuMetricsByApp, and QualRawReportGenerator still emits all three GPU labels unconditionally (unlike the nonEmpty-guarded writes in Profiler.scala).

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style/consistency issues that don't affect correctness.

All logic reviewed: rolling-average usage of stats.med is confirmed correct (AccumInfo uses running-mean semantics, with a TODO to rename it); weighted-average formula in rollupGpuRows is arithmetically sound; numTasks=0 fallback now emits a warning; CSV guards in Profiler.scala are correct. Only pre-flagged P2s remain open.

QualRawReportGenerator.scala — unconditional GPU label emission differs from Profiler.scala's nonEmpty guard.

Important Files Changed

Filename Overview
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala Core of the PR: adds GPU metric aggregation at stage/SQL/app levels. stats.med is the rolling average (correctly named per AccumInfo TODO), rollupGpuRows weighted-avg formula is correct, and numTasks=0 fallback now emits a warning.
core/src/main/scala/org/apache/spark/sql/rapids/tool/store/AccumMetaRef.scala Extends METRICS_WITH_MAX_AGGREGATES from 4 to 9 entries; new entries are consistent with max-aggregate semantics and match RAPIDS plugin GpuTaskMetrics.
core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala Adds three new case classes (StageAggGpuMetricsProfileResult, SQLAggGpuMetricsProfileResult, AppAggGpuMetricsProfileResult) with correct Optional[Long] fields and CSV serialisation.
core/src/main/scala/com/nvidia/spark/rapids/tool/views/QualRawReportGenerator.scala GPU labels added unconditionally to the output map, unlike the nonEmpty-guarded writes in Profiler.scala; CPU-only apps will emit empty GPU entries in the qual path.
core/src/test/scala/com/nvidia/spark/rapids/tool/profiling/AnalysisSuite.scala Three new tests cover GPU log, max-aggregated metric semantics, and CPU-only empty output; rollup math assertions are thorough.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[app.accumManager] -->|filter isGpuMetric| B[GPU Accumulators]
    B -->|calculateAccStatsForStage| C[aggregateGpuMetricsByStage]
    C -->|StageAggGpuMetricsProfileResult| D[gpuStageRows]
    D -->|groupBy stageId| E[stageMap]
    F[app.sqlIdToStages] --> G[aggregateGpuMetricsBySql]
    E --> G
    G -->|rollupGpuRows per SQL| H[SQLAggGpuMetricsProfileResult]
    D -->|rollupGpuRows all stages| I[aggregateGpuMetricsByApp]
    I --> J[AppAggGpuMetricsProfileResult]
    D --> K[AggRawMetricsResult.gpuStageAggs]
    H --> L[AggRawMetricsResult.gpuSqlAggs]
    J --> M[AggRawMetricsResult.gpuAppAggs]
    K -->|nonEmpty guard| N[gpu_stage_level_aggregated_task_metrics.csv]
    L -->|nonEmpty guard| O[gpu_sql_level_aggregated_task_metrics.csv]
    M -->|nonEmpty guard| P[gpu_app_level_aggregated_task_metrics.csv]
Loading

Reviews (3): Last reviewed commit: "Address greptile review on PR #2088" | Re-trigger Greptile

parthosa and others added 2 commits April 30, 2026 16:14
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

# Conflicts:
#	core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AggRawMetricsResult.scala
#	core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAggTrait.scala
#	core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ApplicationSummaryInfo.scala
#	core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/Profiler.scala
#	core/src/main/scala/com/nvidia/spark/rapids/tool/views/QualRawReportGenerator.scala
#	core/src/main/scala/com/nvidia/spark/rapids/tool/views/RawMetricProfView.scala
- Drop dead StageAggGpuMetricsProfileResult.aggregateStageProfileMetric.
  Stage attempts already merge upstream at the AccumInfo layer
  (stagesStatMap is keyed by stageId only, not stageId+attemptNumber),
  so a separate merge step on the case class is never invoked. Replaced
  the method with a comment explaining the upstream merging.
- Document the numTasks=0 invariant in aggregateGpuMetricsByStage and
  log a warning if the stage-task metrics cache lookup misses (which
  would silently distort the task-weighted avg at SQL/app level).

Fixes NVIDIA#2020

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Copy link
Copy Markdown
Collaborator

@hirakendu hirakendu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, super useful!

@parthosa parthosa merged commit 4d88577 into NVIDIA:dev May 4, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core_tools Scope the core module (scala)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants